# Function to create symlinks to the original data files on active to the
# current working directory of the R project.
mk_symlinks <- function(linked_dirname, filepaths) {
dir.create(linked_dirname, recursive = TRUE)
lapply(filepaths, function(file) {
target <- file.path(linked_dirname, basename(file))
if (!file.exists(target)) {
command <- glue::glue("ln -svf '{file}' '{target}'")
system(command)
}
})
}The pipeline runs the Bowtie2 alignment, quality trimming of reads with trimgalore, SEACR peak calling, and optionally MACS2 peak calling. It will also perform general QC statistics on the fastqs with fastqc and the alignment. Finally, the QC reports are collected into a single file using multiQC.
A DAG (directed acyclic graph) of the workflow is show below:
First, fork the repository from Children’s bitbucket. Do this by clicking the “create fork” symbol from the bitbucket web interface and fork it to your personal bitbucket account, as illustrated below.
The figures are from another pipeline, but the steps are the same for forking the respository and selecting the appropriate version.
Next, you will need to clone your personal repository to your home in Cybertron. See the image below for where you can find the correct URL on your forked bitbucket repo.
Copy that URL to replace the one below.
#on a terminal on the Cybertron login nodes
cd ~
# your fork should have your own userID (rather than jsmi26)
git clone https://childrens-atlassian/bitbucket/scm/~jsmi26/cutandrun_nf.git
cd ~/cutandrun_nfOnce inside the code repository, use the latest release branch or
make sure you’re using the same release as prior analysis by using
git.
git fetch
git branch -aThe git branch command will show all available remote branches, including remote branches, like:
* main
remotes/origin/HEAD -> origin/main
remotes/origin/dev
remotes/origin/feature/sample_sheet_val
remotes/origin/main
remotes/origin/release/1.0.0
remotes/origin/release/1.1.0
Checkout the release branch using this command:
git checkout release/1.1.0Which will state that you are now on release/1.1.0
branch and that it is tracking the release branch in your personal
repository.
Checking out files: 100% (55/55), done.
Branch release/1.1.0 set up to track remote branch release/1.1.0 from origin.
Switched to a new branch 'release/1.1.0'
Optional: use tmux on the cybertron login nodes. Name the session
nextflow and then request an interactive session before activating the
nextflow conda environment. The project codes can be found with
project info command.
Change the QUEUE and NAME variables in the code chunk below to be accurate for your Cybertron projects.
tmux new-session -s nextflow
NAME="RSC_adhoc"
QUEUE="sceaq"
qsub -I -q $QUEUE -P $(project code $NAME) -l select=1:ncpus=1:mem=8g -l walltime=8:00:00Next, for the conda environment to be solved, you will need to set channel_priority to flexible in your conda configs as well. To read more about conda environments and thier configurations, check out the documentation.
conda config --describe channel_priority # print your current conda settings
conda config --set channel_priority flexible # set to flexible if not already doneThen create the conda environment that has the appropriate software,
including nextflow, nf-core, and
graphviz. The conda environment will need to be built only
1 time, afterwards, simply use conda activate nextflow.
conda env create -f env/nextflow.yaml
conda activate nextflowA sample sheet in csv (comma separated values) format is used as input to the pipeline. This sample sheet must have the following column names in any order:
Below is an example of a complete sample sheet for use in the
pipeline, which can be edited for your own samples in
test_data/test_dataset_sample_sheet.csv.
sample_sheet <- read.csv(here::here("test_data/test_dataset_sample_sheet.csv")) %>%
mutate(single_end = "false") %>%
mutate(sample = str_split_fixed(sample_id, pattern = "_", n = 3)[, 1]) %>%
select(sample, everything())
head(sample_sheet)# dim(sample_sheet)To ensure that the pipeline works, first run the test data set. This
example will run using the data found in the
test_sample_sheet.csv.
./main_run.sh "test_dataset"Open the configuration file nextflow.config and edit the
necessary parameters for building the index, and/or running the
alignment or peak calling steps.
## //working directory for temporary/intermediate files produced in the workflow processes
## workDir = "$HOME/temp"
##
## //global parameters
## params {
## // general options
## sample_sheet = "./test_data/test_dataset_sample_sheet.csv"
## queue = 'sceaq'
## project = '207f23bf-acb6-4835-8bfe-142436acb58c'
## outdir = "./results/mouse"
## peaks_outdir = "${params.outdir}/peaks_calls"
## publish_dir_mode = 'copy'
##
## //Bowtie params
## // description: fasta OR the bowtie index path are required. If only fasta provided, set build_index to true
## build_index = false
## fasta = '/gpfs/shared_data/Bowtie2/mm39.fa'
## index = '/gpfs/shared_data/Bowtie2/mm39_index'
## save_unaligned = false
##
## <...>
Be sure to change the following lines for the global parameters:
## //global parameters
## params {
## // general options
## sample_sheet = "./test_data/test_dataset_sample_sheet.csv"
## queue = 'sceaq'
## project = '207f23bf-acb6-4835-8bfe-142436acb58c'
## outdir = "./results/mouse"
## peaks_outdir = "${params.outdir}/peaks_calls"
## publish_dir_mode = 'copy'
Additionally, determine if you require a new bowtie2 index to be build for the target genome and/or the spike-in genome. The pipeline requires either a fasta filepath OR Bowtie2 index filepath. This is also required for the spike-in, with E. Coli provided as a default.
E. coli is the default since it that is a carry over DNA from the Cut&Run library prep methodology and is expected to be present in all Cut&Run experiments regardless if exogenous spike-in is used like Yeast. Please see here for more information on spike-in normalization.
Change the following lines for alignment reference files when needed:
## //Bowtie params
## // description: fasta OR the bowtie index path are required. If only fasta provided, set build_index to true
## build_index = false
## fasta = '/gpfs/shared_data/Bowtie2/mm39.fa'
## index = '/gpfs/shared_data/Bowtie2/mm39_index'
## save_unaligned = false
##
## //spike-in params
## // description: fasta OR index path are required. If only fasta provided, set build_spike_index to true
## build_spike_index = false
## spike_fasta = '/gpfs/shared_data/Bowtie2/GCF_000005845.2_ASM584v2_genomic.fa'
## spike_index = '/gpfs/shared_data/Bowtie2/ecoli_index'
Finally, decide whether to run MACS2 calls along with the SEACR peak
calling algorithm (default = true). MACS2 will use the effective genome
size value provided in gsize parameter.
If you are using a non-model organism or simply don’t want to use the
effective genome size provided in literature or MACS2 documentation, you
can set run_khmer = true to calculate an effective genome
size using the target genome fasta fasta filepath and
read-length.
## //Effective genome size
## gsize = 1.87e9 //should NOT be scientific notation to be used with deeptools
## <...>
## //MACS2 and khmer params
## run_macs2 = true
## run_khmer = false //if true, will override the value in gsize parameter
## kmer_size = 150 //kmer_size is the read-length
In the nextflow.config, you can define additional
command line arguments to the scientific software under process
scope. You may use the advanced options to change computational
resources requested for different processes. The CPUs and memory
parameters can updated to request a larger amount of resources like CPUs
or memory if files are large. You may also edit the commandline
parameters for processes in the workflow using the ext.arg
directive.
Please be aware the default command line parameters for
Bowtie2 processes are already provided for both target and
spike-in alignment, but can be edited.
The most commonly modified and important process parameters are
listed toward the top of the process scope in the
nextflow.config file.
You can edit the command line parameters for SEACR and
MACS2 parameters that often need to be re-run multiple
times when deciding on the appropriate peak-set to use. For example,
MACS2 broad and narrow peak calling parameters for
different histone modifications which can be modified using the
ext.args parameter.
## [1] // Computational resource allocation for the processes run in the workflow
## [2] process {
## [3] //Bowtie2 aligner process specific parameters
## [4] withName: BOWTIE2_ALIGN {
## [5] cpus = { 2 * task.attempt }
## [6] memory = { 32.GB * task.attempt }
## [7] ext.args = '--local --very-sensitive-local --no-unal --no-mixed --no-discordant --phred33 -I 10 -X 700'
## [8] ext.args2 = '' //command line arguments for `samtools sort`
## [9] }
## [10] //SEACR peak calling resources
## [11] withName: SEACR_CALLPEAK {
## [12] cpus = { 1 * task.attempt }
## [13] memory = { 16.GB * task.attempt }
## [14] ext.version = '1.4' //version 1.3 and 1.4 supported
## [15] ext.args = '--normalize norm --mode stringent --remove yes'
## [16] publishDir = [...]
## [17]
## [18] }
## [19] //MACS2 peak calling resources
## [20] withName: MACS2_CALLPEAK {
## [21] cpus = { 1 * task.attempt }
## [22] memory = { 16.GB * task.attempt }
## [23] ext.args = '-q 0.01 --keep-dup all'
## [24] publishDir = [...]
## [25]
## [26] }
SEACR has the option to be set to SEACR v1.4 or SEACR v1.3 - which have particularly different commandline interfaces, changes in the methods for normalization to IgG, and v1.4 can optionally remove peaks found in IgG. Please see here for the full changelog.
For SEACR v1.3, Often, you will need to change SEACR
from “non” to “norm” for different normalization strategies whether
you’re using IgG normalization or spike-in normalization. The example
below demonstrates how to change the commandline params and version by
editing ext.version and ext.args.
## //SEACR peak calling resources
## withName: SEACR_CALLPEAK {
## cpus = { 1 * task.attempt }
## memory = { 16.GB * task.attempt }
## ext.version = '1.3' //version 1.3 and 1.4 supported
## ext.args = 'norm stringent'
## publishDir = [...]
##
## }
usethis::edit_file(here::here("main_run.sh"))## • Edit '/active/taylor_s/people/jsmi26/RPDEV/cutandrun_nf/main_run.sh'
Decide on the NFX_PROFILE, which allows you to run the
processes either locally, or using the PBS job scheduler on Cybertron,
and determine if you’d like to use singularity containers or docker
containers.
PBS_singularity [DEFAULT, recommended] * you can
submit a PBS job that will use singularity containers on Cybertron *
This takes care of requesting the appropriate resources using
PBS
local_singularity * locally on an interactive
session Cybertron with singularity * requires appropriate computational
resources be requested using
qsub -I -q <queue_name> -P <project_code> -l select=1:ncpus=4:mem=32GB
Edit the script main_run.sh and change the values for
the NFX_PROFILE variable if desired.
## #Options: 'local_singularity', 'PBS_singularity'
## NFX_PROFILE='PBS_singularity'
Edit the variables in the main_run.sh script for
entry-point of the workflow. The default option
“align_call_peaks” for the NFX_ENTRY will run the
full pipeline (QC, alignment, peak calling, coverage tracks).
## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='align_call_peaks'
If you already have aligned BAM files, see
test_data/test_dataset_bams_sample_sheet.csv for an example
to call peaks only using the entry call_peaks.
## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='call_peaks'
Then, execute the main_run.sh script in order to
complete the peak calling on the samples. Provide a small descriptive
prefix for the pipeline run.
./main_run.sh "my_analysis"You can also change the entry-point of the workflow, which is
accomplished by setting the NFX_ENTRY variable in the
main_run.sh script to be bowtie2_index_only.
This will allow the pipeline to run only the Bowtie2 build process and
exit upon completion of the index building step.
## #Options: 'bowtie2_index_only', 'align_call_peaks', 'call_peaks'
## NFX_ENTRY='bowtie2_index_only'
./main_run.sh "bowtie2_index"Under the path provided in the nextflow config for params “outdir”,
you will find directories named for each of the modules. Lets say
params.outdir = ./results/mouse, and
peaks_outdir = "${params.outdir}/peak_calls".
There will be the following file structure:
results/mouse/
bamtobedgraph/
bowtie2_align/
deeptools_bamcoverage/
fastqc/
fastqc_trim/
multiqc/
peak_calls/
picard_markduplicates/
samtools_index/
samtools_nsort/
samtools_sort/
spikein_align/
trimgalore/
In addition, there will be an HTML report with information on where
the temp data is stored in the workDir path, and general
run statistics such as resource utilized versus requested, which helps
with optimization. It will also provide information on how much walltime
was used per sample, total CPU hours, etc.
The HTML file is found in reports directory and will
have the prefix defined on the command line when the
./main_run.sh "my_analysis" was invoked, so in this example
it would be named “my_analysis_{DATE}.html”.
There will also be a detailed nextflow log file that is useful for de-bugging which will also be named in this example, “my_analysis_{DATE}_nextflow.log”.
Finally, the pipeline will produce a DAG - Directed acyclic graph,
which describes the workflow channels (inputs) and the modules. The DAG
image will be saved under dag/ directory with the name
“my_analysis_{DATE}_dag.pdf”.